Academic Open Internet Journal

www.acadjournal.com

Volume 12, 2004

 

 

 

A Simple Algorithm for Minimal Search Data Index: Active Partitioned Data Search

Using PVM - Algorithm that battles with delay

 

N. Rajkumar1, Dr.S.N. Sivanandam2,

1 Research Scholar, Dept. of Computer Science & Engineering

2 HOD, Dept. of Computer Science & engineering

PSG College of Technology

Coimbatore – 641 004, India

Email:rnalliah@Yahoo.co.in

Abstract:

            In this paper, we examine the issue of mining association rules among items of a large database of search engines. The mining of association rules can be mapped into the problem of discovering large item sets where a large item set is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing an index set of itemsets first and then, identifying, within the subdivided set. Generally this is done iteratively for each large Lk itemsets are equally partitioned into Sk subsets. To determine the related data of a searching text each index itemset contains the two candidate keys. One is moral text and another one is key that indicates the subset name where the data is originally located. To address this issue, we propose the parallel algorithm for “Minimal Search Data Index(MSDI)” based on the Active Partition Data Search (APDS). Extensive simulation study is conducted to evaluate performance of the proposed algorithm

Keywords: Parallel Algorithm, Parallel Virtual Machine, DataMining
Introduction

1.1  Parallel Virtual Machine

 

The system called Parallel Virtual Machine (PVM) allows a heterogeneous collection of workstations and computers to construct a parallel processing environment. PVM is portable and runs on a wide variety of modern systems. The PVM system permits a heterogeneous collection of UNIX computers networked together to be viewed by a user program as a single parallel environment with some distinctive features:

A PVM computational model is shown in Figure 1.1. The dotted lines indicate integer component communication and synchronization, and the solid lines indicate the inter-instance communication and synchronization.

 

Input and Partitioning

 

 

 

 

 


                                                                    

 

 

 

 


Fig 1.1: PVM Computational Model

PVM tasks may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, any task in existence may start or stop other tasks or add or delete computers from the virtual machine. Any process may communicate and/or synchronize with any other. Any specific control and dependency structure may be implemented under the PVM system by appropriate use of PVM constructs and host language control-flow statements.

Owing to its ubiquitous nature (specifically, the virtual machine concept) and also because of its simple but complete programming interface, the PVM system has gained widespread acceptance in the high –performance scientific computing community.     

PVM Packing Algorithm

/*database elements added into the PVM by means of n X n matrix format */

foreach transactions rowi Î SIZE do begin /*SIZE may be network capacity */

foreach transactions colj Î SIZE do begin

            a[rowi][col j ] = SIZE/20                     /* divide by means of packet size */

            pvm_spawn(testdrive”,(char**)0,pvmTestDefault,”“, NPROCS,task_id);
            /* Call the test drive for data distribution */

            /* NPROCS can be a SIZE of the subset */

            rowi ++;

            colj ++;

end

end     

foreach roxi Î NPROCS do begin

            /* Connection Establishment */

            pvm_initsend (PvmDataDefault);

            /* Data type Implementation */

            pvm_pkint (&num_data,1,1);

            /* Packing a specified data set */

            pvm_pkint (&a[num_data*rowi] [num_data*colj],num_data,1);

            /* Ready for Data Transmission */

            pvm_send (tast_id[rowi],1);

end

Receiving End

foreach rowi Î NPROCS do begin

            /* Recovery of specified task */

            pvm-recv (task_id[I],2);

            /* unpack an item for reading */

            pvm_upkint (&result [rowi],1,1);

            /* Area that is used to substitute a parallel algorithm */

end

1.2  System study

When we deeply think about recent trends of information technology, Database and its mining concept plays a very important role. From the bottom level “shop” to top level research areas, there is a database everywhere.

            But how to manage these huge databases is a question. For that “Distributive Rules” of data mining give the initial solution to use a candidate key instead of searching an entire database.

            Based on this idea, we proposed the datamining “distributive rules” based algorithm called Minimal Search Data Index (MSDI). In that algorithm Index itemset plays a very important role.

            Various algorithms have been proposed to discover the large itemsets. Generally, these algorithms first construct a candidate set of large itemsets based on some heuristics, and then discover the subset that indeed contains large itemsets.

            In this paper, we propose an algorithm MSDI for time reduced data search. It is very efficient in the filed of very large database management system. Here APDS algorithm is the backbone connectivity for this MSDI algorithm.

Searching For:

 

 

Applications of Data Mining

 
 

 

 


Bypassed Link

 

Ref Key I

 Data Mining, DSIII

 

Ref Key II

 

Ref Key III

 

Ref Key IV

 

 

Data Set III

(25 lakhs Data)

Eg. Data Mining

 

 

 

  DATA INDEX

 

 

Data Set IV

(25 lakhs Data)

 

 

Data Set II

(25 lakhs Data)

 

 

Data Set I

(25 lakhs Data)

 

 

 

 

 

 

Huge Database

(More than 1Crore Data)

 

Search

 
                

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig 1.2 MSDI Architecture

 
 

 

 

 

 

 


/*Ln= Very Large Database Variable*/

 

/* Set all the database index point to BOF */

 

foreach data datasub Î SIZE do begin

 

            forall transaction ip Π n do begin

                        check for data occurrences

                        if (ip eq Li ) then do begin

                                     Raise a flag variable with different signals

                        end

            end

            data ++;

end

 

/*Li = Node that is connected with flag variable */

 

if (flv=GREEN) then do begin

while (| Li[ind] | != ‘  ‘ or |Li[ind]!=’\t’)

            Lq=Li++;

process (L); /* Overall Index Value */

 

forall transactions LqΠ flv do begin

            if (Lq != ‘\0’) then do begin

                        s[Ldats ]=getc(ind[Ldats]

                        Ldats ++

            end

end

 

Lqf = File name that is red from the index page

 

fp=fopen(Lqf,”r”)

 

While((g=getc(fp))!=EOF)

            Display the Result Page

 

Descriptive Algorithm

 

Case I :

while (| Li[ind] | != ‘  ‘ or |Li[ind]!=’\t’)

If exists, raise the flag as green, else raise the red flag.

 

            Open the green flaged datasets and read it until the null value exists

 

                        if (Lq != ‘\0’)

 

            Open the file with read mode, display the contents simultaneously in paging format.

 

Case 2:

 

 

 

 

 

Data set I

Pascal Mode Generation --------------------------------

 

Moral Page

(Data Mining)

 

 

 

 

 

Index Page

 

Moral                          Dataset

 

QAS                            DS4

 

Security                    DS2

 

Data Mining              DS3

 

CSMA/CD                 DS4

 
                       

 

 

 


Data set IV

CSMA/CD-----------QAS -------------------------------------------

 

Data set III

Data Mining ---------------------------------------------------------

 
                     

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig 1.3 .Busy Establishment of Candidate and Index key that indicate suitable subset in a large database

Comparision Experiments

Values in Milli Seconds.

Subset

MSDI

APDS

Dataset Size

S1

(Index Page Generation)

 

0.25

 

------

 

1,27,500 items

S2

(Subset 2)

 

-----

 

0.26

 

-----Do--

S3

(Subset 3)

 

------

 

0.252

 

----- Do ----

S4

(Subset 4)

 

0.13

 

0.252

 

Total

0.38

0.779

 

 

Table 1.1 Comparison of APDS and MSDI in terms of execution time

Fig 1.4 Comparison Chart

 

            Above table shows the relative performance between MSDI and APDS. Here, we use | T|=15, i.e., each transaction has 15 double items on an average, so as to have more large itemsets in later passes for interest of presentation. The execution time of these two algorithms is shown in figure. In the phase 4 of both the algorithms are the same. Phase 4 partitioned the database from very large database. But the other phases are entirely different when compared to MSDI with APDS. Every result may result in different microseconds. The tabulation with the two different database sizes. But when we go to the execution time, the time can be reduced in a large data set when compared to small data set. As mentioned before, in the MSDI algorithm has the options of switching from APDS to MSDI after early passes for better performance, and such an option is not adopted here. It can, nevertheless, be seen from figure that the execution time of the APDS is larger than the total execution time by MSDI.It can be seen from table that the execution time of the first pass of MSDI is additional step compared with APDS. However, MSDI incurs significantly smaller execution times than APDS in later passes, not only in the second pass through out the table.

 

 

 

Scaleup Experiments for MSDI

            The increased database sized graph shows that the execution time of MSDI increases linearly as the database size increases, meaning that MSDI possesses the same important feature as APDS. Also, we examine the performance of MSDI as the number of items, N, increases. Table shows the execution times of MSDI when the number of item increases from 1, 00,000 to 1, 50,000 for these data sets 1.14 msec and 0.94 msec respectively. In other words, a large transaction has a larger likelihood of having large itemsets to process than a small transaction. Also, given a fixed minimum support, when the number of items N increases, the execution time to obtain L, increases since the size of L1 is usually close to N1 but the execution time to obtain larger k-itemsets decreases since the support for an itemset is averaged out by more items and then decreases. Consequently, as shown in table, when N increases, execution time for small transactions increase a little more prominently than those for large transactions. Performance of MSDI when the number of items increases

 

N

Index

Reference

1,90,000

0.12Msec

0.75 Msec

3,00,000

0.09Msec

0.32 Msec

4,50,000

0.06Msec

0.14 Msec

 

Conclusion

            We examined in this paper, the issue of mining distributive rules among various items in a very large database of search engine transactions. The problem of handling large itemsets was solved by constructing an indexed set and then subdivided the large database into number of small scale database. We proposed the algorithm MSDI is especially effective for the identification of specific item in a very large database. We compare this algorithm with APDS algorithm and it concluded that the MSDI algorithm is better than APDS in case of data items increases. Extensive simulation study has been concluded to evaluate performance of the proposed algorithm and that simulation guides “MSDI is the algorithm that battles with delay”.

REFERENCES

 

1)     R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD Int’l Conf. Management of Data, pp. 207-216, Wahington, D.C., May 1993

 

2)     R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design, Implementation, and Experience,” IEEE Trans. Knowledge and Data Eng., vol.8, pp. 962-969,1996

 

3)     R. Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int’lconf. Very Large Data Bases, pp. 487-499, Santiago, Chile, Sept. 1994.

 

4)     R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995 Int’l Conf. Data Eng., pp.265-276,Tucson,Ariz., May 1997.

 

5)     S. Brain, R. Motwani, and C.Silverstein, “Beyond Market Basket: Generalizing Association Rules to Correlations,” Proc. 1997 ACM SIGNMOD Int’l Conf. Management of Data, pp. 265-276, Tucson, Ariz., May 1997

 

6)     S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and PLAP Technology,”AGM SIGMOD Record, vol. 26, pp. 65-74, 1997.

 

7)     M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An overview from a Database Perspective,” IEEETrans. Knowledge and Data Engg, Vol.8, PP.866-883, 1996.

 

8)     D.W. Cheung, J. Han, V. Ng,A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed Information Systems, PP. 1996 Int’l Conf. Data Enf., PP. 106-114, New Orleans, Feb. 1996.

 

9)     Y. Fu and J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed Information Systems, PP. 31-44, Miami Beach, Fla.,Dec. 1996.

 

10) D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique,” Proc. 1996 Int’l Conf, Data Engg. PP. 106-114, New Orleans, Feb. 1996.

 

 

 

Technical College - Bourgas,

All rights reserved, © March, 2000